DATA1220-55, Fall 2024
2024-10-16
Describe the “shape” (i.e. distribution) of numerical variables
Calculate means, medians, modes, variances, standard deviations, IQRs
Learn the appropriate use of summary statistics (i.e. mean vs. median)
Characterize the relationship between 2 numerical variables
Analyze contingency (e.g. 2x2) tables
Summarizing categorical variables with proportions
Comparison of numerical data between categorical groups
Recognize common visualization techniques / plots
Numerical: Dot plots, histograms, density plots, box plots, violin plots
Categorical: bar plots, mosaic plots, tree map
Build basic visualizations in R using ggplot2
Modality
Symmetry
Skew
Outliers
Summary Statistics
What is the modality of the distribution?
Unimodal: one peak
Bimodal: two peaks
Multimodal: many peaks
Uniform: no clear peak, flat distribution
Is the distribution symmetric or asymmetric?
Symmetric: “mirror image”, the distribution to the left of center looks like the distribution to the right of center
Asymmetric: left half looks different than the right half
If the distribution is asymmetric, is it because it’s skewed?
Does the distribution “lean” towards the left or the right?
Does the distribution have a long “tail” on one side but not the other?
Are there outliers in this distribution?
Are there any unusual data points?
How extreme are the most extreme values?
Outliers are rare
When data points are unusual but not rare, they create skew or modality
Is the distribution normal or does it require robust statistics?
When the distribution is very close to normal, the mean + SD will describe the center ~70% of the data
The mean + SD are sensitive to modality, asymmetry, skew, and outliers
It’s never wrong to use the median + IQR, but when the distribution IS normal, the mean + SD are better
The median and interquartile range are considered to be robust statistics for the numerical summary of data because they are less sensitive to skew and outliers than the mean and standard deviation.
Minimum value or Q1 - 1.5 x Interquartile Region
1st quartile (Q1, 25th percentile)
Median (Q2, 50th percentile)
3rd quartile (Q3, 75th percentile)
Maximum value or Q3 + 1.5 x Interquartile Region
A boxplot is a visual representation of a 5-number summary. The “box” represents the middle 50% of the data, or the interquartile range. The line inside the box indicates the median or 50th percentile. The whiskers, the lines coming out from the box, extend 1.5 x IQR beyond Q1 and Q3. Values larger or smaller than that range are classified as outliers and appear as points.
The whiskers of a boxplot (the lines extending out from the box) are 1.5 times the interquartile region long
Min whisker: Q1 - 1.5 x IQR
Max whisker: Q3 + 1.5 x IQR
If a point is outside this range, it is considered to be a potential outlier
The median of this distribution is 72.7, and the mean of this distribution is 71.
The median of this distribution is 166, and the mean of this distribution is 161.9.
The median of this distribution is 72, and the mean of this distribution is 73.6.
The median of this distribution is 69, and the mean of this distribution is 67.5.
The median of this distribution is 418, and the mean of this distribution is 420.1.
The median of this distribution is 50, and the mean of this distribution is 48.4.
How to construct a contingency table with counts for 2 categorical variables.
The row totals are all 1, which is the maximum value of a proportion. This indicates that the denominator for the proportions is the row total for each cell.
The column totals are all 1, which is the maximum value of a proportion. This indicates that the denominator for the proportions is the column total for each cell.
Define probability, random processes, and the law of large numbers
Describe the sample space for disjoint and non-disjoint outcomes
Calculate probabilities using the General Addition and Multiplication Rules
Create a probability distribution for disjoint outcomes
The sample space is the total collection of possible outcomes for a random process.
Die rolls: 1, 2, 3, 4, 5, 6
Coin flips: heads, tails
Stock market: up, down, no change
As more observations are collected, the sample statistic \(\hat{p}\) or \(\bar{x}\) of a particular outcome approaches the population proportion \(p\) or population mean \(\mu\) for that outcome.
The probability of event A or event B occurring is the sum of the probability that A occurs and the probability that B occurs minus the probability that A and B occurs.
\[ \begin{aligned} P(A \operatorname{or} B) &= P(A) + P(B) - P(A \operatorname{and} B) \\ &= P(A) + P(B) - P(A \cup B) \\ &= P(A \cap B) \end{aligned} \]
When events A and B are disjoint, the probability of event A or event B occurring is just the sum of the probability that A occurs and the probability that B occurs, because the probability that event A and event B occurs is 0.
\[ \begin{aligned} P(A \operatorname{or} B) &= P(A) + P(B) - P(A \operatorname{and} B) \\ &= P(A) + P(B) \\ &= P(A \cap B) \end{aligned} \]
The probability of event A and event B occurring is the product of the probability that A occurs and the conditional probability that B occurs given that A has already occurred.
\[ \begin{aligned} P(A \operatorname{and} B) &= P(A) \times P(B \operatorname{given} A) \\ &= P(A) \times P(B | A) \\ &= P(A \cup B) \end{aligned} \]
The probability of event A and event B occurring is the product of the probability that A occurs and the probability that B occurs, because the probability of B does not change based on the outcome of A.
\[ \begin{aligned} P(A \operatorname{and} B) &= P(A) \times P(B \operatorname{given} A) \\ &= P(A) \times P(B | A) \\ &= P(A) \times P(B) \\ &= P(A \cup B) \end{aligned} \]
A Z-score is the number of standard deviations a value falls above (when positive) or below (when negative) the mean of the data
Z-scores standardize a normal distribution by…
- Centering the data at 0 by subtracting the mean from each score
- Scaling the units of the data to 1 by dividing the centered data by the standard deviation
\[ \begin{aligned} Z&=\frac{\operatorname{observed value}-\operatorname{mean}}{\operatorname{standard deviation}} \\ &= \frac{x-\mu}{\sigma} \end{aligned} \]
DATA1220-55 Fall 2024, Class 19 | Updated: 2024-10-16 | Canvas | Campuswire